[Cherry-pick] PRs #1648 #1650 #1594 #1269 #1326 #1652 #1651 #1601 #1653 #1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729#1734
Conversation
Disable codecov binary validation which seems to be constantly failing
```
gpg: Signature made Tue Apr 21 19:28:03 2026 UTC
gpg: using RSA key 27034E7FDB850E0BBC2C62FF806BB28AED779869
gpg: Can't check signature: No public key
==> Could not verify signature. Please contact Codecov if problem continues
Exiting...
```
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Updated CI workflow notes and removed an outdated header comment.
* Added explanatory comments to the Linux job and adjusted the code
coverage upload step to use a relaxed validation mode (no other upload
settings changed).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix `tests/examples/llm_eval/test_llm_eval.py::test_qwen3_eval_fp8` was silently passing while its evals crashed, then began failing as a timeout. This repairs the whole pipeline: - **lm_eval `IndexError` (root cause):** TRT-LLM KV-cache prefix reuse returns truncated `context_logits` for shared-prefix requests (e.g. hellaswag's one-context / many-endings), which breaks `parse_logprobs`. Add an `enable_kv_cache_reuse` flag to `modelopt.deploy.llm.LLM` (default `True`, unchanged) and disable it for the eval deployment so full-length context logits are returned. - **Silent CI green:** `python eval.py | tee result.txt` returns `tee`'s exit code, so a crashing eval was masked. Add `set -o pipefail` to `huggingface_example.sh` so failures fail the test. - **Long-prompt overflows:** with the tiny test model's toy tokenizer, gsm8k/MMLU prompts exceed `max_seq_len`. Bump test `max_position_embeddings` to 8192, skip MMLU prompts that don't fit even at zero-shot, and add an MMLU sample limit (`--mmlu_limit`). - **human-eval build failures:** install with `--no-build-isolation` (`pkg_resources` is absent in pip's isolated build env), patch its malformed `console_scripts` entry point, and pin the clone. - **Cleanups:** gate the post-quant `run_tensorrt_llm.py` smoke test behind the `quant` task (eval tasks deploy on their own; ~45s saved for eval-only runs); replace the SIGPIPE-prone serve-readiness `tail -f | while` with a poll loop (required under `pipefail`). ### Usage N/A — example/test fix. ### Testing All four eval tasks verified end-to-end in the CI container (TRT-LLM 1.3.0rc17, RTX 6000 Ada): lm_eval (hellaswag + gsm8k), MMLU, and simple_eval (humaneval) all complete with exit 0 and no `IndexError`/overflow. Cold full run ≈ 340s on this GPU. CI test on 2-gpu: https://github.com/NVIDIA/Model-Optimizer/actions/runs/27154417497/job/80153551154 ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (new `enable_kv_cache_reuse` defaults to current behavior; new script flags are optional) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A (no new dependencies) - Did you write any new necessary tests?: N/A (fixes and strengthens an existing test) - Did you update Changelog?: N/A (bug fix to examples/tests) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information The full test runs ~340s on an RTX 6000 Ada; CI runners are historically slower, while `@pytest.mark.timeout` is set to 600 — worth watching the first CI run and bumping if it's close. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an option to limit MMLU evaluation length. * **Bug Fixes** * Disabled KV-cache prefix reuse for evaluations needing per-token context logits to prevent truncated/incorrect logprobs. * Skip examples whose prompts remain too long; warn and report accuracy as NaN if all examples are skipped. * **Chores / Scripts** * Improved example scripts for reproducible installs, patched entry point handling, pipeline failure detection, conditional test invocation, polling-based log wait, and a new CLI flag for MMLU limits. * **Tests** * Increased timeout and prompt headroom; capped MMLU smoke tests for speed. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: ? New example <!-- Details about the change. --> Adds example for Alpamayo-1 quantization with ModelOpt (FP8, NVFP4, AutoQuant) ### Usage ``` python quantize.py --ckpt nvidia/Alpamayo-R1-10B --output-dir ./alpamayo-r1-fp8 --quantize fp8 ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> - Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run `/claude review`. NVIDIA org members can self-trigger for complex changes; orthogonal to CodeRabbit. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Alpamayo 1 vision-language-action model quantization example supporting FP8, NVFP4, and mixed-precision optimization modes * Introduced CLI quantization tool with calibration loop and checkpoint export capabilities for both fake-quantized and real-quantized formats * **Documentation** * Added comprehensive guide documenting the Alpamayo quantization example, model details, and usage instructions <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?
Type of change: New Feature <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->
Adds HuggingFace `config.json` export of skip-softmax sparse-attention
calibration for diffusion pipelines (e.g. Wan 2.2), on top of the base
skip-softmax work.
- **`_export_diffusers_checkpoint`** walks every `nn.Module` component
of a diffusers pipeline, calls `export_sparse_attention_config`, and
writes the result into that component's `config.json` under the
`sparse_attention_config` key. The sparse config lives **only** in
`config.json` — there is no standalone `sparse.yaml`.
- **`export_sparse_attention_config`** emits a `config_groups` schema
where each algorithm's parameters are nested inside its own group; only
`config_groups` and `producer` are top-level:
- skip-softmax group → `algorithm: "skip_softmax"`, `targets`, `ignore`
(layers kept dense — e.g. cross-attention + first/last blocks),
`initial_disabled_steps` (opt-in, user-set; emitted only when `> 0`),
`threshold_scale_factor` (`a * exp(b * target_sparsity)`), and
`target_sparsity`.
- N:M group → `algorithm: "sparse_softmax"` with
`sparsity_n`/`sparsity_m`, `dense_sink_tokens`, `dense_recent_tokens`
flattened into the group.
- **Deploy reader**
(`modelopt/torch/sparsity/attention_sparsity/plugins/sparse_attn_config.py`)
reads these per-group params back, keeping the export↔load round-trip
consistent.
- **Example wiring**:
`examples/diffusers/sparsity/wan22_skip_softmax.py` gains
`--export-dir`, `--skip-softmax-threshold`, and
`--initial-disabled-steps`. `--export-dir` runs
`export_hf_checkpoint(pipe, export_dir=...)` after calibration.
- Updated `CHANGELOG.rst`.
### Usage
```bash
python examples/diffusers/sparsity/wan22_skip_softmax.py \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--calibrate --target-sparsity 0.5 --calib-size 4 \
--initial-disabled-steps 5 \
--export-dir ./wan22_skip_softmax_ckpt
```
Resulting layout — a `config.json` per component, **no `sparse.yaml`**:
```
wan22_skip_softmax_ckpt/
├── transformer/config.json # carries sparse_attention_config
├── transformer_2/config.json # carries sparse_attention_config
├── vae/ … text_encoder/ … tokenizer/ … scheduler/ …
└── model_index.json
```
A representative `config.json` entry for a diffusion transformer:
```json
"sparse_attention_config": {
"config_groups": {
"group_0": {
"algorithm": "skip_softmax",
"targets": ["WanAttention"],
"ignore": ["blocks.0.attn1", "blocks.0.attn2", "…"],
"initial_disabled_steps": 5,
"threshold_scale_factor": {
"formula": "a * exp(b * target_sparsity)",
"prefill": {"a": 1443.49, "b": 4.30}
},
"target_sparsity": {"prefill": 0.5}
}
},
"producer": {"name": "modelopt", "version": "0.45.0..."}
}
```
The N:M variant adds a second group:
```json
"group_1": {
"algorithm": "sparse_softmax",
"targets": ["WanAttention"],
"sparsity_n": 2, "sparsity_m": 4,
"dense_sink_tokens": 0, "dense_recent_tokens": 64
}
```
### Testing
- `tests/examples/diffusers_sparsity/test_sparsity.py`: baseline /
triton-baseline / fixed-threshold runs of the Wan 2.2 example, plus a
Python-API calibrate → **export** test asserting the nested
`sparse_attention_config` (`threshold_scale_factor`, `target_sparsity`,
`ignore`, `initial_disabled_steps`) and the absence of any
`sparse.yaml`.
-
`tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py`
and `test_sparse_attn_config.py`: unit coverage of the per-group export
schema and the deploy-reader round-trip (writer nests → reader reads
from groups → internal mtsa config unchanged).
- Validated end-to-end on Wan 2.2 T2V-A14B: full 4-prompt / 40-step /
81-frame calibration; the exported checkpoint carries the nested schema
in both `transformer` and `transformer_2` `config.json`, and runtime
measurement shows ~47–49% tile sparsity at a 0.5 target.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ❌ The exported
`sparse_attention_config` schema was renamed and nested per-group during
0.45.x development, and the loader reads only the new layout —
checkpoints exported by earlier 0.45.x builds must be re-exported. No
released version is affected. <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->
### Additional Information
<!-- E.g. related issue. -->
---------
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?
**Type of change:** New example + new `modelopt.torch.fastgen` library
module.
Adds **DMD2 (Distribution Matching Distillation) for Qwen-Image** —
distilling the base model into a few-step (1–4) generator. Includes the
framework-agnostic `modelopt.torch.fastgen` loss library (DMD pipeline,
EMA, optional GAN discriminator) and a NeMo AutoModel–based training
example with a mock-data smoke config, a real-data config, and inference
/ export scripts.
**Noted**: the example script will be migrated to AutoModel repo
### Usage
```bash
# Mock-data wiring smoke — runs end-to-end with no dataset to prepare
torchrun --nproc-per-node=8 \
examples/diffusers/fastgen/dmd2_finetune.py \
--config examples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yaml
```
See `examples/diffusers/fastgen/README.md` for real-data training and
inference.
### Testing
Unit tests under `tests/unit/torch/fastgen/`; `pre-commit` /
code-quality clean.
### Before your PR is "*Ready for review*"
- Backward compatible?: ✅ (new, additive module)
- Followed `CONTRIBUTING.md` for any copied code / new deps: ✅
- New tests added?: ✅
- Updated Changelog?: N/A
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Adds a FastGen-based distillation framework (DMD2) with
student/fake-score training, EMA support, GAN discriminator branch,
inference pipeline, and export utilities.
* Qwen-Image integration with latent packing and feature-capture for
plugin-enabled pipelines.
* **Documentation**
* New README, example configs, and runnable example scripts for
Qwen-Image distillation and inference.
* **Tests**
* Comprehensive unit tests covering math parity, gradient routing,
plugins, hooks, EMA, and recipe setup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…only PTQ recipes (#1652) ### What does this PR do? Type of change: new feature (recipes) Several `general/ptq` recipe families shipped a data-driven FP8 KV-cache (`-kv_fp8`) variant but lacked the constant-amax `kv_fp8_cast` companion that `fp8_default` and `nvfp4_default` already have. This PR adds the missing cast variants so every KV-quantizing (and the weight-only) family offers the calibration-free FP8 KV-cache option: - `general/ptq/nvfp4_experts_only-kv_fp8_cast` - `general/ptq/nvfp4_mlp_only-kv_fp8_cast` - `general/ptq/nvfp4_omlp_only-kv_fp8_cast` - `general/ptq/nvfp4_weight_only-kv_fp8_cast` Each new recipe composes the exact same model-quant config as its existing sibling and swaps the `kv_fp8` unit for the shared `kv_fp8_cast` unit (constant-amax FP8 KV cache; no KV calibration forward pass). The docs guide table/tree and the changelog are updated to match. ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path <model> \ --recipe general/ptq/nvfp4_mlp_only-kv_fp8_cast ``` ### Testing Extended the built-in PTQ smoke test `tests/unit/recipe/test_loader.py::test_load_recipe_all_builtins` with the four new recipe paths; all four load into a valid `ModelOptPTQRecipe` with a populated `quantize` section. ``` $ python -m pytest tests/unit/recipe/test_loader.py tests/unit/recipe/test_presets.py -q 180 passed ``` `pre-commit` (including the `validate modelopt recipes` hook) passes on all changed files. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (additive — only new recipe files) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ (extended the builtin recipe smoke test) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ❌ (not yet) ### Additional Information The two weight-only families were discussed for scope; `nvfp4_weight_only` is included (it already names a KV mode, `kv_fp16`), while `int4_blockwise_weight_only` is intentionally left untouched since it carries no `-kv_` composition. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added four new NVFP4 PTQ (Post-Training Quantization) recipe variants: experts-only, MLP-only, OMLP-only, and weight-only configurations. * All new recipes include FP8 KV-cache cast mode support for improved inference performance. * **Documentation** * Updated built-in recipes guide with new NVFP4 recipe options and repository layout. * **Tests** * Expanded recipe loader test coverage for new recipe configurations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
) ### What does this PR do? Type of change: CI / infrastructure (build-time speedup) ModelOpt's CUDA quantization extensions (`modelopt_cuda_ext`, `_fp8`, `_mx`) JIT-compile via `torch.utils.cpp_extension.load()` on first use — ~110–140s **each** in a fresh container, which is the dominant cost of the `gpu_trtllm` job and the TRT-LLM example jobs. This caches them across runs. The logic lives in a reusable composite action, **`.github/actions/cache-extensions`**, used by both `gpu_tests.yml` and `_example_tests_runner.yml`: - Sets a **literal in-container `TORCH_EXTENSIONS_DIR`** (`/root/.cache/torch_extensions`). `${{ github.workspace }}` can't be used — for `container:` jobs it resolves to the *host* path, which is mounted elsewhere (`/__w`) inside the container, so torch and the cache step would disagree on the location. - Caches that dir with `actions/cache`, keyed on a caller-supplied **env discriminator** (`rtxpro6000` + container image) plus a `hashFiles` of the kernel/loader sources — so the cache busts on any kernel change and is scoped per arch+image. - On an **exact hit**, **backdates the kernel sources** below the cached objects so ninja reuses them. (Touching the *objects* instead desyncs ninja's `.ninja_deps`, which records each output's build-time mtime → `stored deps info out of date` → rebuild.) Also fixes the unused `runner` default in `_example_tests_runner.yml` (`h100` → `rtxpro6000`) so it can't seed a wrong-arch cache. ### Usage N/A — CI only. To reuse from another job: ```yaml - uses: ./.github/actions/cache-extensions with: cache-key: rtxpro6000-${{ matrix.container_image }} # GPU arch + image ``` ### Testing Validated on `gpu_trtllm`: cache hit → `ninja: no work to do` → `test_cuda_ext*` dropped from **113s / 108s / 139s → 2.8s / 0.03s / 0.03s** (~360s saved per run). Jobs that build no extension (e.g. `gpu_vllm`) simply skip the save. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (CI-only; key busts on source/image change) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update Changelog?: N/A (CI infrastructure) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information - Single-arch assumption: callers pass `rtxpro6000` in `cache-key`; if the runner fleet ever mixes GPU archs, update that prefix (the cache path is not arch-specific). - No explicit TTL: the key is content-addressed, and GitHub auto-evicts caches unused for 7 days (+ 10 GB/repo LRU). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…r examples/megatron_bridge (#1601) ### What does this PR do? Type of change: documentation (+ minor test fixes) Migrates the Nemotron-3-Nano-30B-A3B-BF16 tutorial quantization step from `examples/llm_ptq/hf_ptq.py` to the Megatron-Bridge quantize + export, and relocates the tutorial next to the scripts it now uses. Now that the whole tutorial is Megatron-Bridge based, it lives under `examples/megatron_bridge/`. - **Quantization migration:** replace the single `hf_ptq.py` call with `examples/megatron_bridge/quantize.py` (calibrate + save a Megatron checkpoint) → `examples/megatron_bridge/export.py` (deployable unified HF checkpoint). The FP8 results table is refreshed with the `quantize.py` numbers (same defaults, slightly better on average). - **Relocation:** moved `examples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/` → `examples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/`. A **redirect-stub `README.md`** remains at the old path (a directory symlink isn't traversable in the GitHub web UI), and all in-repo references (root README, CHANGELOG, pruning READMEs, megatron_bridge README) plus the tutorial's own relative links are updated. - **Evaluation:** per-format vLLM benchmark commands (BF16 / FP8), FP8 deployment notes documented in `nemo_evaluator.yaml`, reduced LiveCodeBench/AIME `num_repeats` (were too slow), and bumped the `nemo-evaluator-launcher` pin. - **Misc:** drop the `examples/megatron_bridge/requirements.txt` `transformers<5` pin in favor of an inline "downgrade `transformers<5` to save pruned Nemotron checkpoints" note; guard the hybrid Mamba-MoE sharded-state-dict test behind `HAS_MAMBA` (requires `mamba_ssm`); shrink the tiny Gemma3 test fixture's attention heads. > **Note:** the **NVFP4 + QAD** experiments (formerly the focus of this PR) are split out — their accuracy/throughput results are still in progress — and will follow in a separate PR on top of this one. ### Testing Docs-only + test-guard changes. Pre-commit hooks (markdownlint, RST checks, ruff, mypy) pass. The tutorial's relative links and the old-path redirect stub were verified to resolve to real files. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (old tutorial path still resolves via a redirect-stub README; `quantize.py`/`export.py` already exist in `examples/megatron_bridge`) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (adjusts/guards existing tests only) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (existing tutorial entry updated to the new path) - Did you get Claude approval on this PR?: ✅ ### Additional Information Supersedes the previous "Part 3 of 4 (NVFP4 + QAD docs)" scope of this PR; the NVFP4 + QAD tutorial additions will land in a follow-up. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Moved the Nemotron-3-Nano-30B-A3B tutorial into the Megatron-Bridge tutorials and replaced the old file with a pointer to the new location. * Updated vLLM throughput numbers to 2.6× and expanded results/throughput tables. * Reworked the FP8 quantization/export workflow and added a note to use transformers<5 when saving pruned models. * Added a tutorials index and adjusted evaluator launcher pin and repeat counts. * **Tests** * Tests now detect optional Mamba support and skip related tests when unavailable. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
#1653) ### What does this PR do? Type of change: new feature Brings the GPT-OSS lossless MXFP4 → NVFP4 cast (#1372) to DeepSeek V4's routed-expert export by adding a `--cast_mxfp4_to_nvfp4` flag to `examples/deepseek/deepseek_v4/quantize_to_nvfp4.py`. To avoid duplicating the closed-form math, the shared numerics — `mxfp4_to_nvfp4_global_amax`, `mxfp4_to_nvfp4_per_block_amax`, and the E2M1/E4M3/E8M0 constants — are **hoisted out of the GPT-OSS example cast into the library** at `modelopt/torch/quantization/utils/numeric_utils.py`. Both the GPT-OSS cast (`examples/llm_ptq/cast_mxfp4_to_nvfp4.py`) and the new DeepSeek path now import them from there. DeepSeek V4's routed experts ship as MXFP4 (E2M1 nibbles + a power-of-two E8M0 scale per 32-element block). By default the export dequantizes them to BF16 and re-quantizes to NVFP4 using the calibrated per-tensor weight amax, which re-derives per-block scales from the data and is therefore lossy. With the flag, the cast pins `scale_2 = 2^(k_max-8)` and each per-block E4M3 scale to `2^(k_j-m)` straight from the source E8M0 scales, so `per_block_scale * scale_2 = 2^k_j` and the NVFP4 nibbles equal the source MXFP4 nibbles bit-for-bit (for every block whose `k_j` lands in E4M3's representable window; rare out-of-range blocks clamp). The one V4-specific addition is that w1/w3 share a single `scale_2` for the fused GEMM1, so `k_max` is taken over both projections. The flag only affects routed-expert **weights** — activation `input_scale` still comes from `--amax_path` calibration. ### Usage ```bash python deepseek_v4/quantize_to_nvfp4.py \ --amax_path ${AMAX} \ --source_ckpt ${DS_V4} \ --output_ckpt ${HF_NVFP4_PATH} \ --cast_mxfp4_to_nvfp4 ``` ### Testing - The hoisted numerics get unit tests in `tests/unit/torch/quantization/test_numeric_utils.py` (10 cases: per-tensor global_amax, per-block amax incl. out-of-range, magnitude-table cache) — 10/10 pass. The example test `tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py` keeps the cast-specific cases (quantizer naming, `build_amax_map`, `apply_to_model`). - Validated on real DeepSeek-V4-Flash expert tensors (incl. the on-disk `float8_e8m0fnu` scale dtype): 23.5M blocks, 100% lossless, 0 error. - Generated a full NVFP4 checkpoint for DeepSeek-V4-Flash (43 layers, 256 routed experts) end-to-end: `[cast] lossless MXFP4->NVFP4 blocks: 8,657,043,456/8,657,043,456 (100.0000%)`. Output weights match an independently-produced reference cast byte-for-byte (`weight_scale`, `weight_scale_2`, packed nibbles modulo the harmless sign-of-zero). ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (new opt-in flag; default export behavior unchanged; hoist re-exports through the existing example module) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ N/A (no new deps; shared numerics moved into the library rather than duplicated) - Did you write any new necessary tests?: ✅ (library numerics covered by `tests/unit/torch/quantization/test_numeric_utils.py`; end-to-end validated on a real DeepSeek-V4 checkpoint) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ❌ (will run `/claude review`) ### Additional Information Mirrors and reuses #1372 (GPT-OSS MXFP4 → NVFP4 cast); the closed-form numerics are now shared via `modelopt.torch.quantization.utils.numeric_utils`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `--cast_mxfp4_to_nvfp4` flag to perform a closed-form, mostly lossless MXFP4→NVFP4 conversion for routed-expert weights with aggregated lossless/block statistics. * **Documentation** * Updated DeepSeek V4 export instructions and README to document the new flag and clarify calibration behavior for activation scales. * **Chores** * Exposed shared numeric quantization utilities for MXFP4→NVFP4 casting. * **Tests** * Added and updated tests to validate the new numeric helpers and conversion behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?
Type of change: Bug fix
INT8 entropy calibration of fp16 ONNX models (e.g. ConvNext /
EfficientViT / YOLOv8 backbones quantized via `python -m
modelopt.onnx.quantization --quantize_mode=int8`) used to fail during
histogram collection with:
```
ValueError: Too many bins for data range. Cannot create 128 finite-sized bins.
```
`_collect_value` in `modelopt/onnx/quantization/ort_patching.py` derives
`threshold = max(abs(min), abs(max))` from the activation tensor and
passes `range=(-threshold, threshold)` to `np.histogram(...)`. When the
model is fp16 and a calibrated activation has a small range (≲ 1e-5),
both endpoints inherit fp16 dtype. Under numpy 2.0's NEP-50 strict
promotion, the resulting fp16 `linspace` collapses consecutive 128-bin
edges to the same value and numpy refuses to build the histogram. numpy
1.x silently used higher-precision intermediate dtype, masking the
issue.
The fix casts the range endpoints to Python `float` so numpy computes
bin edges in float64 regardless of input dtype. Applied at both call
sites: `_collect_value` and the single-node variant
`_collect_value_histogram_collector_single_node_calibration`.
### Usage
```bash
# The affected workflow — INT8 entropy calibration of any fp16 ONNX model:
python -m modelopt.onnx.quantization \
--quantize_mode=int8 \
--onnx_path=model.fp16.onnx \
--calibration_data_path=calib.npy
```
No API change.
### Testing
- Added `test_collect_value_fp16_narrow_range` in
`tests/gpu/onnx/test_ort_patching.py` that calls `_collect_value` with a
fp16 tensor (mostly zeros + one ~1e-5 value) and asserts the histogram
is built without raising and all bin edges are distinct. Fails on the
buggy code, passes after the fix.
- Reproduced the original failure on numpy 2.2.6 before the fix.
- Full `tests/gpu/onnx/test_ort_patching.py` suite (31 tests) passes.
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Bug Fixes**
* Fixed INT8 entropy calibration for fp16 ONNX models failing with NumPy
>= 2.0. Histogram range computation now correctly handles fp16
activations with small dynamic ranges.
* **Tests**
* Added test coverage for INT8 calibration with fp16 activations using
narrow value ranges.
<!-- review_stack_entry_start -->
[](https://app.coderabbit.ai/change-stack/NVIDIA/Model-Optimizer/pull/1558?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)
<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix Fixes `apply_chat_template` failures when loading `nemotron-sft-agentic-v2` with Nemotron3 Nano tokenizer. HF agentic datasets store OpenAI-style `tool_calls` with `function.arguments` as JSON **strings**, but Nemotron v3 chat templates iterate `tool_call.arguments|items` in Jinja2, which requires a **mapping**. That mismatch raised: ``` TypeError: Can only get item pairs from a mapping. ``` This PR: - Adds shared `prepare_messages_for_chat_template()` in `modelopt.torch.utils.dataset_utils` to normalize string tool-call arguments to dicts (including both nested `function.arguments` and top-level `arguments`). - Routes `get_dataset_samples` / `get_dataset_dataloader` chat-template paths through the helper with `reasoning_content="native"` and `normalize_tool_calls=True`, preserving `reasoning_content` for tokenizers that handle it natively while fixing tool calls. - Refactors `megatron_preprocess_data._process_messages` to delegate to the same helper (no behavior change: `strip`/`inline` still handle reasoning; `native` still returns messages unchanged without tool-call normalization). - Consolidates tests: hermetic logic stays in unit tests; one live GPU integration test covers the v3 calibration path. ### Testing - New e2e tests added to replace previous simpler tests - Manual verification (Nemotron 3 Nano tokenizer + `nemotron-sft-agentic-v2`): ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> - Did you get Claude approval on this PR?: Not yet <!--- Run `/claude review`. NVIDIA org members can self-trigger for complex changes; orthogonal to CodeRabbit. --> ### Additional Information Root cause: Nemotron v3 Jinja chat templates use `tool_call.arguments|items`; OpenAI-format dataset rows store arguments as JSON strings. Related prior art in-repo: `megatron_preprocess_data` already normalized tool-call arguments inline; this PR deduplicates that logic into `prepare_messages_for_chat_template`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added public utilities for preparing OpenAI-style chat messages with reasoning content support, including native reasoning mode handling. * Implemented automatic tool call argument normalization for consistent tokenizer operations. * **Refactor** * Consolidated chat template application across registered and auto-detected chat datasets using unified preprocessing. * **Tests** * Added unit and integration tests validating reasoning content preparation and chat template functionality. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?
Type of change: documentation
Adds two docs under `modelopt_recipes/` (no code or behavior changes):
- **`README.md`** — catalog of the recipe library: its purpose (a recipe
is the
single, version-controlled source of truth for *how* a model is
optimized), the
directory layout (`general/`, `huggingface/`, `models/`, `configs/`),
how to
load/select recipes (`load_recipe`, `--recipe`), and a high-level map of
the
general PTQ combos, speculative-decoding, and distillation recipes.
- **`recipe.md`** — a focused guide to the PTQ schemes: the general
`general/ptq/`
body scopes (full-model FP8/NVFP4, scoped experts-only / mlp-only /
omlp-only,
weight-only), KV-cache modes (`kv_fp8_cast` / `kv_nvfp4_cast` /
`kv_fp8`),
calibration variants (max / mse / gptq / layerwise), low- vs
high-concurrency
deployment guidance, and the model-specific recipes under `huggingface/`
and
`models/` — each compared to its general baseline.
### Usage
```python
# Documentation only. The recipes themselves load as before, e.g.:
from modelopt.recipe import load_recipe
cfg = load_recipe("general/ptq/nvfp4_experts_only-kv_fp8_cast")
```
### Testing
`pre-commit run --files modelopt_recipes/README.md
modelopt_recipes/recipe.md`
passes (markdownlint, modelopt recipe validation, license/format hooks).
### Before your PR is "*Ready for review*"
- Is this change backward compatible?: N/A <!-- docs only -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A <!-- docs only -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!-- docs only -->
- Did you get Claude approval on this PR?: ❌ <!-- not yet -->
### Additional Information
Documentation for the `modelopt_recipes/` library; content verified
against the
recipe YAMLs and the `modelopt.recipe` / config-loader source.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Added comprehensive ModelOpt recipes guide describing YAML-based,
composable optimization workflows, directory/lookup layout, reuse via
imports, and how to add or share recipes.
* Added PTQ quantization guide covering recipe naming/structure,
quantization scopes and KV-cache options, calibration variant guidance,
model-specific overrides, multimodal considerations, and a
checkpoint-mirroring example.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…a 4 MTP (#1677) Type of change: Bug fix Fixes the specdec_bench vLLM wrapper's MTP `speculative_config` emission so Gemma 4 MTP no longer hits the wrong code path inside vLLM. vLLM's `SpeculativeConfig.__post_init__` (`vllm/config/speculative.py:529-602`) auto-detects `method` ONLY when it's unset. When `model` is provided and `method` is `None`, the default branch sets `method = "draft_model"` — the generic same-architecture draft path, NOT MTP. That path enforces equal num_heads between target and draft and raises: ``` AssertionError: All layers in one attention group must share num_heads; got {8, 4} ``` on heterogeneous-head models. Gemma 4 has 8 target heads and 4 draft heads by design. PR #1663 changed the MTP branch in the wrapper to emit `{model: <assistant>, num_speculative_tokens: N}` WITHOUT `method` when `draft_model_dir` was provided, based on a misread of vLLM PR #41745's test plan that only showed the `{model, num_speculative_tokens}` shape. That test plan was the direct `LLM(...)` constructor invocation; vLLM had already defaulted method internally. Going through specdec_bench's `AsyncEngineArgs(speculative_config=...)` path, the explicit `method` key is required to avoid the auto-detect → draft_model fallback. vLLM's own test at [`tests/v1/e2e/spec_decode/test_spec_decode.py:818-823`](https://github.com/vllm-project/vllm/blob/main/tests/v1/e2e/spec_decode/test_spec_decode.py#L818) does exactly this for the gemma4-e4b parametrization: ```python speculative_config = { "method": method, # "mtp" "num_speculative_tokens": ..., } if draft_model is not None: # Gemma 4 case speculative_config["model"] = draft_model ``` Restore `method="mtp"` as the unconditional MTP path. ADD `model` only when `draft_model_dir` is set. Backward-compatible for Qwen 3.5 MTP / DeepSeek MTP / other inline-MTP families (they keep the bare `{method: "mtp"}` config). Field-tested via vLLM PR #41745's correctness test on `gemma-4-E4B-it` + `gemma-4-E4B-it-assistant`: produced 304.7 output TPS at γ=4 vs 171.0 baseline (178% speedup) on H100. The same `speculative_config` shape this fix emits. [OMNIML-5024](https://jirasw.nvidia.com/browse/OMNIML-5024) pipeline - Wrapper emitted `{model: assistant, num_speculative_tokens: 3}` - vLLM auto-detected `method = "draft_model"` - Loaded gemma-4-E4B-it-assistant (4 heads) as a generic draft for gemma-4-E4B-it (8 heads) - Attention-group num_heads check tripped → AssertionError, task_0 FAILED, task_1 CANCELLED - Backward compatible: ✅ (Qwen 3.5 / DeepSeek MTP unchanged; only the MTP+`draft_model_dir` case changes). - New tests: ❌ — the test exercising this codepath would need a GPU + gemma-4 model checkout, which is cluster work, not unit-test scope. JIRA-tracked validation via OMNIML-5024 dispatch after this lands. - Changelog: ❌ - vLLM PR #41745 (Gemma4 MTP support) - Companion: NVIDIA/Model-Optimizer PR #1675 (launcher `GlobalVariables.draft_model` schema fix) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Bug Fixes** * Fixed speculative decoding configuration handling in the benchmark example to ensure consistent method assignment and proper draft model configuration. * **Documentation** * Updated configuration comments to reflect corrected behavior and improved clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: New feature (PTQ recipe) Adds a declarative YAML recipe for post-training quantization of **Nemotron-H** hybrid models (Mamba-2 + MLP + Attention) under the `modelopt_recipes` framework. The recipe is mixed-precision and composed **entirely from existing recipe units** — no core-library or `hf_ptq.py` changes are required. (Quantized `nn.Embedding` support, which the embedding line relies on, already landed in #1495.) Precision mirrors the **GGUF Q4_K_M** bit allocation of the same model, mapped onto NVFP4/FP8. `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`: | Precision | GGUF source | Modules | | --- | --- | --- | | NVFP4 W4A4 | Q4_K / Q5_0 | in_proj, out_proj, up_proj, attn q/k/v/o_proj, down_proj (Q4_K layers 13,15,20,22,27,29,37,39) | | FP8 W8A8 | Q6_K | MLP down_proj (layers 1,3,5,8,10,18,25,33,41) | | NVFP4 W4A16 (weight-only) | — | input embedding | | FP8 W8A16 (weight-only) | — | lm_head | | bf16 | F32 | Mamba conv1d, all norms, A_log / D / dt_bias | The Q8_0 attn `v_proj` layers (24, 32) are kept **NVFP4 W4A4** rather than FP8: ModelOpt's export fuses q/k/v (they share the attention input) and requires one format across the group, so `v` can't diverge from `q`/`k`. Built from the units `base_disable_all`, `w4a4_nvfp4_nvfp4`, `default_disabled_quantizers`, `configs/numerics/fp8`, and `configs/numerics/nvfp4`. ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \ --recipe models/Nemotron-H/Nemotron-3-Nano-4B/nvfp4_w4a16 \ --trust_remote_code \ --export_path nemotron-3-nano-4b-nvfp4 ``` ### Testing - `pre-commit run --files modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml` passes, including the `validate modelopt recipes` schema hook. - End-to-end PTQ + unified HF export on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (calibration: `cnn_dailymail`, 512 samples, seq 512). Produced a 2.57 GB unified HF checkpoint; `hf_quant_config.json` was verified per-layer against the table above — 9 FP8 W8A8 `down_proj` + FP8 weight-only `lm_head`, NVFP4 W4A16 embedding, NVFP4 W4A4 everywhere else, with q/k/v/o uniform within each attention layer (required for export fusion). ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors). - Is this change backward compatible?: ✅ — purely additive; a new opt-in recipe file. - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A — declarative config; covered by the `validate modelopt recipes` pre-commit hook. - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — new recipe config, not a library API change. - Did you get Claude approval on this PR?: ❌ — pending `/claude review`. ### Additional Information Depends on #1495 (quantized `nn.Embedding` support) for the embedding line to pack on export. Possible follow-ups (out of scope here): - A compressed-tensors conversion pass so the checkpoint is consumable by vLLM (`*.weight → *.weight_packed`, `*.weight_scale_2 → *.weight_global_scale`, and a `format: nvfp4-pack-quantized` / `quant_method: compressed-tensors` quantization config). - A `--vllm-compat`-style variant that additionally excludes Mamba `in_proj` (output dim `17504 = intermediate + conv_dim + num_heads` is not divisible by 64, violating Marlin repack alignment) and leaves `lm_head` / embedding in bf16, for out-of-the-box vLLM consumption. --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…en (#1673) ### What does this PR do? Type of change: Bug fix <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> Fixes the generation **preview** in `examples/llm_ptq/hf_ptq.py` producing garbage output (e.g. repeated `\u200b` zero-width-space tokens) for models whose tokenizer has `pad_token == eos_token` — most visibly GLM-5.1. The garbage appeared *before* quantization, so it was not a quantization issue. **Root cause:** `pre_quantize` / `post_quantize` take the first (left-padded) calibration sample and call `full_model.generate(preview_input_ids, ...)` **without an `attention_mask`**. HuggingFace only auto-infers the mask when `pad_token_id != eos_token_id` (`generation/utils.py:_prepare_attention_mask_for_generation`); when they are equal it falls back to an all-ones mask, so the model attends to the leading pad/eos tokens, ignores the real prompt, and (for GLM's MoE/DSA/MTP path) collapses to a single repeated token. Calibration itself was always correct — it already passes the mask; only the preview generation was missing it. **Fix:** thread the calibration batch's `attention_mask` through to both preview `generate()` calls. One file changed (`examples/llm_ptq/hf_ptq.py`, +20/-8). ### Usage No usage change — the same command now produces a coherent preview instead of `\u200b` repetition ### Testing Reproduced the exact mechanism (left padding + pad_token == eos_token + missing attention_mask) on a small model(GPT2): without the mask the model emits the same HF warning as the bug report and ignores the prompt; with the mask the output is byte-identical to the unpadded baseline. Verified no behavioral change for models where pad != eos (the explicit mask equals HF's inferred input_ids.ne(pad_id)) and for Whisper (its batch carries no attention_mask, so the path is unchanged). Pre-commit: ruff-check, ruff-format, and mypy (no new errors vs. main) all pass. Before your PR is "Ready for review" Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S). Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.). - Is this change backward compatible?: ✅ <!-- Only changes internal helper signatures within the example script; no public API affected. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A <!-- No copied code, no new dependency. --> - Did you write any new necessary tests?: N/A <!-- Preview path requires model loading; no existing unit-test harness covers it. Verified via a standalone repro of the root-cause mechanism. --> - Did you update Changelog?: N/A <!-- Bug fix confined to an example-script preview; not a library/API change. Happy to add a 0.46 bug-fix entry if preferred. --> - Did you get Claude approval on this PR?: ✅ <!-- Will run `/claude review` before requesting review. --> ### Additional Information Backward compatible across model familes: | Model class | Before (no mask passed) | After (mask passed) | Result | |---|---|---|---| | `pad != eos` (most: T5, BART, many LLMs) | HF infers mask = `input_ids.ne(pad_id)` | explicit calib mask = same tensor | **Identical output** — no change | | `pad == eos` (GLM-5.1, GPT-2-style) | all-ones fallback → attends to pad → garbage | correct mask | **Fixed** | | Whisper | no mask | batch has no `attention_mask` key → `None` → no mask | **Identical** — no change | | Nemotron-VL / DeepSeek / NemotronH / `--skip_generate` | `generate()` not called on this path | unchanged | No change | <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Enhanced LLM post-quantization example to properly handle attention masks during preview generation. The quantization preview now correctly threads attention masks through generate() calls, ensuring accurate generation outputs are captured both before and after quantization steps. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
) ### What does this PR do? Type of change: Bug fix Fix `--quant_cfg` CLI parsing by typing `quant_cfg` as `str | None` instead of `str | QuantizeConfig | None` ### Testing ``` accelerate launch --config_file examples/gpt-oss/configs/zero3.yaml examples/gpt-oss/sft.py --config examples/gpt-oss/configs/sft_full.yaml --model_name_or_path openai/gpt-oss-20b --quant_cfg MXFP4_MLP_WEIGHT_ONLY_CFG --output_dir gpt-oss-20b-qa ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run `/claude review`. NVIDIA org members can self-trigger for complex changes; orthogonal to CodeRabbit. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Quantization config parameter now accepts string identifiers or none; resolution behavior for named presets remains unchanged. * **Documentation** * Updated argument reference to reflect the parameter type change while preserving the deprecation note and usage guidance. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1687) ### What does this PR do? Type of change: Bug fix Exclude Qwen visual and vision_tower modules from NVFP4 quantization and keep the Qwen linear attention projection exclusions. These modules can produce matrix dimensions that are incompatible with vLLM 0.22.1's ModelOpt FP4 Marlin fallback path, causing checkpoint load or profiling failures such as `size_n = 4304 is not divisible by tile_n_size = 64`. ### Usage N/A. This is a recipe configuration fix. ### Testing - `python -m pytest tests/unit/recipe/test_presets.py tests/unit/recipe/test_loader.py -q` - `python -m pre_commit run --files modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml tests/unit/recipe/test_loader.py tests/unit/recipe/test_presets.py` - E2E validation with `vllm/vllm-openai:v0.22.1`: PTQ export validation passed with zero Marlin-incompatible quantized layers, and vLLM `/health`, `/v1/models`, and `/v1/completions` passed. The final PR broadens the validated visual MLP exclusions to the full `*visual*` subtree and adds the common `*vision_tower*` naming pattern. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: Yes - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: Yes - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added unit tests that verify the built-in PTQ recipe and preset correctly disable incompatible projection and visual components for certain quantization modes. * Ensures quantization settings are validated across recipes and presets. * **Chores** * Updated quantization configuration to disable quantizers for select projection and vision-related model layers. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…295242) (#1678) Type of change: Bug fix Fixes the GPT-OSS MXFP4 → NVFP4 PTQ path (`examples/llm_ptq/hf_ptq.py` with `--cast_mxfp4_to_nvfp4`), which failed in three independent ways. The documented command now runs end-to-end and produces a bit-exact (100% lossless) NVFP4 checkpoint. Addresses **nvbug 6295279** (OMNIML-5046) and **nvbug 6295242** (OMNIML-5045). 1. **nvbug 6295242 — CUDA illegal memory access on load.** GPT-OSS ships native MXFP4 weights that Transformers dequantizes to BF16; the threaded weight loader trips an illegal-memory access when `device_map="auto"` shards the dequant across **multiple GPUs**. The missing optional `kernels` package only *forces* the dequant path — it is not the root cause. `get_model` now detects MXFP4 checkpoints and loads them with `Mxfp4Config(dequantize=True)` on a **sequential** device map so the dequant stays on a single device. `kernels` is no longer required. 2. **nvbug 6295279 #1 — `NotImplementedError: Mxfp4GptOssExperts` during unified HF export.** Forcing `dequantize=True` yields plain `GptOssExperts` (even when `kernels` is installed), which ModelOpt wraps and exports normally. 3. **nvbug 6295279 #2 — `FileNotFoundError` in the cast step.** `--cast_mxfp4_to_nvfp4` treated `--pyt_ckpt_path` as a local dir; a HF Hub ID now resolves to its cached snapshot dir via `_resolve_model_path`. Also fixes a **static-block NVFP4 regression** (surfaced by the cast's `force_weight_quantizers_static`, introduced by #1560's now-unconditional `weight_only_quantize`): `_QuantGptOssExperts` / `_QuantLlama4TextExperts` quantize their expert weights transposed in the forward (`_transposed_quantize`), but the inherited `iter_weights_for_calibration` fed the non-transposed weight, locking a mismatched block-quant `_original_shape` and raising `ValueError: Input shape has changed`. The override now calibrates on the transposed view, matching both the forward and the export's `_amax` orientation. `get_model` never had explicit handling for a *natively pre-quantized MXFP4* checkpoint — GPT-OSS fell through the generic *unquantized-checkpoint* branch and relied on Transformers' **implicit** MXFP4 behavior, which is fragile across three axes. The cast was originally validated (#1372, 2026-05-01) in the "lucky" quadrant of each: - **GPU count:** `device_map="auto"` on a single GPU never shards, so the dequant stays on one device. On multiple GPUs `auto` balances the model and shards the MXFP4→BF16 dequant across devices → CUDA illegal-memory crash (6295242). - **`kernels` presence:** without `kernels`, Transformers auto-dequantizes to BF16 `GptOssExperts` (exportable). With `kernels` installed it keeps the packed `Mxfp4GptOssExperts` kernel path → export `NotImplementedError` (6295279 #1). - **Transformers version:** the kernel-backed experts wrapper and the threaded multi-GPU weight loader are newer-Transformers behavior (env here is 5.5.4). Earlier versions simply dequantized MXFP4 → BF16, which is what the old generic path happened to need. The QA env sat in the *breaking* quadrant (multi-GPU and/or `kernels` present, newer Transformers), so the implicit path failed. The new branch makes both decisions explicit and deterministic (`dequantize=True` + single-device load), regardless of environment — mirroring the existing `has_pack_quantized_config` branch for compressed-tensors checkpoints. The fourth issue (static-block `Input shape has changed`) is a separate regression: it was introduced by **#1560 (2026-06-02, "Make sure all weight quantizers have `_amax`")**, a month *after* the cast landed. previously it ran only when no calibration `forward_loop` was supplied, and the cast always supplies one — so the non-transposed weight-quantizer call simply never happened before. The conflict only appears at the intersection of (a) transposed-quantize experts (GPT-OSS/Llama4), (b) static-block NVFP4 — which `--cast_mxfp4_to_nvfp4` forces via `force_weight_quantizers_static` — and (c) #1560. CI's GPT-OSS NVFP4 coverage uses the *dynamic*-block path, which never locks the block shape, so #1560 looked safe. ```bash python hf_ptq.py \ --pyt_ckpt_path openai/gpt-oss-20b \ --qformat nvfp4_mlp_only \ --cast_mxfp4_to_nvfp4 \ --export_path ./gpt-oss-20b-nvfp4 ``` - Ran the documented command end-to-end on 2xB200 (`openai/gpt-oss-20b`): cast overrode **48/48** expert weight quantizers, **100% lossless** layers/blocks, exported a valid packed-NVFP4 HF checkpoint (uint8 weights + FP8 per-block `weight_scale` + per-tensor `weight_scale_2` + `hf_quant_config.json`). - Verified plain `--qformat nvfp4_mlp_only` (no cast) still works end-to-end. - **Independently verified the export is bit-exact:** dequantized the exported NVFP4 weights (ModelOpt's E2M1 LUT + pack layout) and compared against Transformers' canonical MXFP4→BF16 dequant (`Mxfp4Config(dequantize=True)`) over all 24 layers × both expert weights — `max_abs_err = 0`, 100% bitwise-equal in bf16. So `dequant(exported NVFP4) == dequant(original MXFP4)` exactly. - New unit tests: `test_get_original_hf_quant_method_*` (load detection) and `test_gpt_oss_experts_iter_weights_for_calibration_transposed` (the transpose regression). Existing `test_cast_mxfp4_to_nvfp4.py` (8 tests) still pass. `pre-commit` clean. **Known limitation:** verified for gpt-oss-20b (fits one GPU). gpt-oss-120b dequantized does not fit a single GPU, so `sequential` would still span GPUs — that case would need a CPU-dequant-then-dispatch path and is left as a follow-up. - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (0.45 Bug Fixes) - Did you get Claude approval on this PR?: ❌ (not yet run) nvbug 6295279, nvbug 6295242 / OMNIML-5046, OMNIML-5045. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Bug Fixes** * Prevented CUDA illegal-memory access during MXFP4→NVFP4 casting. * Fixed expert-weight calibration orientation to avoid shape mismatches. * **New Features** * Support loading native MXFP4 checkpoints with automatic dequantization. * Resolve remote model identifiers to local checkpoints when casting MXFP4→NVFP4, improving reliability. * **Tests** * Added unit and GPU regression tests covering quant-method detection, casting, and expert-weight calibration. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…6293731, 6293762) (#1691) ### What does this PR do? Type of change: Bug fix Fixes two sglang deployment failures on multimodal Gemma (`gemma-4-31B-it`) caused by general PTQ presets leaking quantization into the SigLIP vision branch via broad wildcards: - **NVBug 6293731** — `general/ptq/fp8_default-kv_fp8`: the `w8a8_fp8_fp8` unit enables bare `*weight_quantizer` / `*input_quantizer`, which also match the vision tower (`model.vision_tower.*`, `model.visual.*`) and the vision embedding projection (`model.embed_vision.*`). The exported checkpoint deploys but emits **garbled text** in sglang. - **NVBug 6293762** — `general/ptq/nvfp4_mlp_only-kv_fp8`: the `*mlp*` enables also match the vision tower's block MLPs (`model.vision_tower.encoder.layers.*.mlp`), and an image request **crashes** the FP4 kernel at decode: `ValueError: too many values to unpack (expected 2)` in sglang's `modelopt_quant.py` `apply`. ### Fix Add `*embed_vision*` / `*vision_tower*` / `*visual*` disable rules to the shared `configs/ptq/units/default_disabled_quantizers` unit, alongside the existing `*router*` / `*lm_head*` entries. Both the composed `general/ptq/*` recipes **and** the `configs/ptq/presets/model/*` presets import this unit, so: - every general recipe (`fp8_default`, `nvfp4_default`, `nvfp4_mlp_only`, `nvfp4_omlp_only`, …) keeps the vision branch in BF16 by default — fixing the whole vision-overreach class, not just the two reported recipes; - the `test_general_ptq_yaml_matches_config_dicts` YAML↔preset parity test stays satisfied (both sides pick up the new entries from the one shared unit). The rules are **no-ops on text-only models** (nothing matches). A recipe that intentionally wants to quantize the vision branch can re-enable these after importing the unit. Files changed: - `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml` (+14) ### Testing Re-export of `gemma-4-31B-it` with the affected recipes and re-deploy in sglang (the env from the bug reports: `lmsysorg/sglang:v0.5.12.post1`, GB200) to confirm fp8_default no longer garbles text and nvfp4_mlp_only no longer crashes on image requests. _(Results to be appended.)_ Unit-level: `tests/unit/recipe/test_loader.py::test_general_ptq_yaml_matches_config_dicts` (parity) passes for all four general presets. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (text-only checkpoints unaffected; new rules only match vision modules that should never have been quantized by a general recipe) - If you copied code from any other sources or added a new PIP dependency: N/A - Did you write any new necessary tests?: N/A (recipe data fix; covered by the existing parity test + verified by real PTQ export + sglang deploy) - Did you update Changelog?: N/A - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information NVBug 6293731 and 6293762. Reported on modelopt 0.45.0rc0, GB200, gemma-4-31B-it, sglang 0.5.12.post1. Tracked under OMNIML-5034. Companion to PR #1690 (same vision-overreach class on the gemma-specific `w4a8_awq` recipe, NVBug 6294017). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated quantization configuration to preserve BF16 precision for vision encoder components in multimodal models. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…fo in clear_stale_value_info (#1697) ### What does this PR do? Type of change: Bug fix INT4 quantization upgrades the model to opset >= 21, at which point ONNX Runtime runs type inference while building the AWQ calibration `InferenceSession`. Custom ops backed by TensorRT plugins (domain `trt.plugins`) have no ORT type-inference function, so their output types are only known from the `value_info` that TensorRT type/shape inference populated earlier in preprocessing. `clear_stale_value_info` cleared `value_info` wholesale, dropping those types, so ORT failed output type inference for the custom op at model load, e.g.: ``` Node (Conv-2) Op (IdentityConv) output arg (X2) type inference failed ``` - `modelopt/onnx/utils.py`: in `clear_stale_value_info`, preserve `value_info` entries for outputs of `trt.plugins`-domain nodes (which ORT cannot re-derive); clear the rest as before. - `tests/gpu/onnx/quantization/test_plugin.py`: add a regression test quantizing a model with the built-in `CustomSkipLayerNormPluginDynamic` plugin at INT4 + awq_clip (the opset >= 21 path), asserting the quantized model is produced and the custom op survives. ### Usage ```python python -m modelopt.onnx.quantization \ --onnx_path=model.onnx \ --quantize_mode=int4 \ --calibration_method=awq_clip \ --trt_plugins=/path/to/plugin.so ``` ### Testing - `pytest tests/gpu/onnx/quantization/test_plugin.py -k int4_awq` — fails before the fix (ORT type-inference error at calibration-session load) and passes after. The full `test_plugin.py` (including the existing INT8 quantization and autocast cases) passes. - The example [here](https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/onnx_ptq/README.md#quantize-an-onnx-model-with-custom-op) also failed before this fix, now passes. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional info Fixing regression inserted by #1565 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Preserve metadata for TensorRT plugin outputs during cleanup and correctly reconcile output data types so custom plugin ops remain intact after optimization/quantization. * **Tests** * Added a GPU ONNX regression test covering int4 quantization with AWQ calibration to ensure TensorRT plugins are retained. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Gwenaelle Cunha Sergio <gcunhasergio@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…#1702) ### What does this PR do? Type of change: Bug fix Fixes nvbug **6311147** (OMNIML-5103). `examples/deepseek/deepseek_v3/ptq.py` resolved the cloned DeepSeek-V3 / DeepSeek-V3.2-Exp inference repos relative to its own directory (`deepseek_v3/`) via `Path(__file__).resolve().parent`. But the [README](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/deepseek) clones those repos into the parent `examples/deepseek/` directory and runs the script from there, so the lookup landed one level too deep and raised `ValueError: DeepSeek-V3 or DeepSeek-V3.2-Exp not found` (the error message also printed the wrong directory). The fix resolves from `parent.parent` via a single `DEEPSEEK_DIR` base shared by both repo paths and the error message. ### Usage ```bash # Run from examples/deepseek/ as documented in the README, after cloning # DeepSeek-V3 (or DeepSeek-V3.2-Exp) into that directory: torchrun --nproc-per-node 8 --master_port=12346 deepseek_v3/ptq.py \ --model_path $DS_CKPT \ --config DeepSeek-V3/inference/configs/config_671B.json \ --quant_cfg NVFP4_DEFAULT_CFG \ --output_path $FP4_QUANT_PATH ``` ### Testing - Confirmed against the repro path: with the file at `examples/deepseek/deepseek_v3/ptq.py` and the repos cloned into `examples/deepseek/`, `Path(__file__).resolve().parent.parent` now points at `examples/deepseek/` so `DeepSeek-V3/inference` resolves correctly. - Verified the sibling `examples/deepseek/deepseek_v4/` does not share the bug (it takes an explicit `--dsv4_inference_dir` argument instead). - `pre-commit` clean. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (one-line path fix in an example script that requires the DeepSeek repos + multi-GPU checkpoint to exercise) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A (bug is in a 0.45-cycle example, not a regression from a released version) - Did you get Claude approval on this PR?: ❌ (not yet run) ### Additional Information nvbug 6311147 / OMNIML-5103. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved path resolution in the example script to more reliably locate the required inference repository. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix Extends the calibration/memory-probe `use_cache` guard to Step 3.7-style nested text configs. Step 3.7 remote code reads the language config under `model.config.text_config` directly and raises `AttributeError` when `use_cache` is absent during PTQ calibration with Transformers >5. This keeps the existing Step 3.5 behavior and applies the same temporary set/restore logic to the nested text config. ### Usage No API change. PTQ calibration continues to use the existing forward-loop path. ### Testing - `pre-commit run ruff-format --files modelopt/torch/utils/dataset_utils.py tests/unit/torch/utils/test_dataset_utils.py` - `pre-commit run ruff-check --files modelopt/torch/utils/dataset_utils.py tests/unit/torch/utils/test_dataset_utils.py` - `python -m py_compile modelopt/torch/utils/dataset_utils.py tests/unit/torch/utils/test_dataset_utils.py` - `python -m pytest tests/unit/torch/utils/test_dataset_utils.py -k "disable_use_cache or iter_use_cache_configs or forward_loop_runs_under_disabled" -vv` ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A ### Additional Information This is separate from PR #1693. Step 3.7 needs both fixes if both failure paths are exercised: this PR fixes PTQ calibration-time `use_cache` handling, while PR #1693 fixes exported config `layer_types` metadata for deployment config loading. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved handling of cache flags stored in nested model configuration objects: cache is reliably disabled during dataset operations and restored or removed afterward. * **Tests** * Added unit tests covering nested-config disabling, restoration/removal of cache flags post-operation, and deduplication when nested configs reference the same object. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Fixes #1658 Type of change: Bug fix, documentation This PR updates the Puzzletron dataset preparation flow to use the already published prebuilt dataset `nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2` by default, avoiding the need to download the full raw `nvidia/Nemotron-Post-Training-Dataset-v2` dataset (~136 GB) just to filter it down to the same ~2.6 GB result. Changes included: - Add `PREBUILT_KD_DATASET` constant in `prepare_dataset.py` - Short-circuit dataset preparation when `dataset_name` matches the prebuilt dataset, loading it directly and skipping the download + filtering pipeline - Update 8 Puzzletron example configs to use the prebuilt dataset path by default - Update the Puzzletron README to document the default ~3 GB path and clarify that the raw ~136 GB path is still available if users want to reproduce preprocessing Default lightweight path: ```bash python -m modelopt.torch.puzzletron.dataset.prepare_dataset \ --dataset_name nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 \ --output_dir path/to/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 ``` Raw dataset path (existing behavior, still supported): ```bash python -m modelopt.torch.puzzletron.dataset.prepare_dataset \ --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 \ --output_dir path/to/Nemotron-Post-Training-Dataset-v2 ``` - Ran `pre-commit run --all-files` - Most hooks passed successfully - Local pre-commit `mypy` reported unrelated existing errors in: - `modelopt/torch/opt/config_loader.py` - `modelopt/recipe/loader.py` - Verified this change separately with a local mock-based test: - prebuilt dataset path correctly loads and saves directly - original raw dataset path remains untouched - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: N/A This change preserves the original raw-dataset workflow for users who explicitly want to regenerate the filtered dataset from scratch, while making the default example flow much lighter and easier to use. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Documentation** * Updated setup instructions to use a prebuilt, optimized dataset by default, simplifying the model compression workflow. * **Chores** * Updated model compression configurations across multiple examples to use the prebuilt dataset. * Enhanced dataset preparation to support prebuilt dataset handling for more efficient setup. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Sabari07 <sabursd18@gmail.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
fixes the oom (cpu ram) issue (reported in #1681) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Optimized memory management during model validation operations. Explicit resource cleanup procedures are now performed after each solution validation, preventing memory accumulation and eliminating out-of-memory errors during extended validation workflows. * **Configuration** * Updated default validation dataset configuration setting. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR adds new Alpamayo and Qwen-Image example workflows, updates LLM quantization and sparse-attention export paths, expands recipe and tutorial documentation, improves dataset and Puzzletron utilities, fixes ONNX handling, and adjusts CI workflows, caching, and evaluation scripts. ChangesCI and workflow updates
Alpamayo quantization example
LLM quantization and evaluation fixes
FastGen DMD2 diffusion stack
Sparse attention export and example updates
Chat-template utilities and Puzzletron dataset updates
Recipes and tutorial documentation
ONNX calibration and metadata fixes
Sequence Diagram(s)sequenceDiagram
participant Config as DMD2 config
participant Recipe as DMD2DiffusionRecipe
participant Pipeline as DMDPipeline
participant Checkpoint as sidecar checkpoint
Config->>Recipe: load config and overrides
Recipe->>Pipeline: build student, teacher, fake_score, discriminator
Recipe->>Pipeline: run student or fake-score phase
Pipeline-->>Recipe: return phase losses
Recipe->>Checkpoint: save student and DMD2 sidecar state
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release/0.45.0 #1734 +/- ##
==================================================
- Coverage 77.48% 76.81% -0.68%
==================================================
Files 489 504 +15
Lines 54415 55332 +917
==================================================
+ Hits 42165 42501 +336
- Misses 12250 12831 +581
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/example_utils.py (1)
646-659:⚠️ Potential issue | 🟠 MajorFix
has_pack_quantized_config()to handle object-style quantization configs likeget_original_hf_quant_method()does.The function calls
.get()directly onquantization_configwithout checking whether it's a dict or object. In Transformers 4.56.0, quantization config objects (e.g.,Mxfp4Config) do not support dict-style.get()access—only attribute access. This will raiseAttributeErrorat runtime ifquantization_configis an object instead of a dict.The same file already demonstrates the correct pattern in
get_original_hf_quant_method()(lines 546–548), which usesisinstance(quant_cfg, dict)to branch between.get()for dicts andgetattr()for objects. Apply this same guard tohas_pack_quantized_config()on lines 630 and 636.Suggested fix
def has_pack_quantized_config(config): + def _cfg_get(qcfg, key, default=None): + return qcfg.get(key, default) if isinstance(qcfg, dict) else getattr(qcfg, key, default) + # Check top-level quantization_config if hasattr(config, "quantization_config"): - if config.quantization_config.get("format", None) == "pack-quantized": + if _cfg_get(config.quantization_config, "format") == "pack-quantized": return True # Check nested text_config.quantization_config (for multi-modal models like kimi k2.5) if hasattr(config, "text_config") and hasattr( config.text_config, "quantization_config" ): - if config.text_config.quantization_config.get("format", None) == "pack-quantized": + if _cfg_get(config.text_config.quantization_config, "format") == "pack-quantized": return True return False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/llm_ptq/example_utils.py` around lines 646 - 659, The `has_pack_quantized_config()` function calls `.get()` directly on `quantization_config` without checking whether it's a dict or an object, which causes AttributeError at runtime when the config is an object like `Mxfp4Config` that doesn't support dict-style access. Fix this by adopting the same pattern already used in `get_original_hf_quant_method()` at lines 546-548: add an `isinstance(quantization_config, dict)` check to branch between using `.get()` for dict-style configs and `getattr()` for object-style configs. Apply this guard to both `.get()` calls in `has_pack_quantized_config()` around lines 630 and 636.
🧹 Nitpick comments (1)
modelopt/torch/fastgen/plugins/__init__.py (1)
24-27: ⚡ Quick winDefine explicit
__all__in package__init__.pybefore wildcard re-export.This package re-exports plugin symbols but does not declare its own
__all__. Add module-level__all__and extend it fromqwen_image.__all__when the plugin import succeeds so the public surface stays explicit.As per coding guidelines, “Define the public API with
__all__at the top of each module and re-export viafrom .module import *in package__init__.pyfiles.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/fastgen/plugins/__init__.py` around lines 24 - 27, The package __init__.py file in modelopt/torch/fastgen/plugins/ performs a wildcard import from qwen_image but does not define its own __all__ to explicitly declare the public API. Define a module-level __all__ variable (can be initialized as an empty list or with expected symbols), then within the import_plugin context block for qwen_image, extend __all__ to include the symbols from qwen_image.__all__ after the successful import. This ensures the public surface of the package remains explicit and follows the coding guidelines for defining public APIs.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/alpamayo/quantize.py`:
- Around line 650-652: The global torch.no_grad() context manager wrapping the
main() function call disables gradient computation globally, which prevents the
--quantize auto path from computing gradients needed by the auto_quantize() API
for gradient-based sensitivity scoring during the search phase. Remove the with
torch.no_grad(): wrapper and call main() directly without the context manager to
allow gradients to flow through the loss function when needed.
- Around line 461-467: The debug logging statements that extract GPU tensor
values to CPU scalars using .item() calls (on v_pred and v_target with
torch.isfinite, min, max, and abs().mean() operations) create unnecessary
CPU-GPU synchronization points in the optimization loop hot path. Either remove
these print statements entirely, or gate them behind a conditional debug flag
(such as if debug_logging:) that defaults to False so they do not execute during
normal operation. This will eliminate the synchronization overhead while
preserving the ability to enable detailed logging when needed for debugging.
In `@examples/deepseek/deepseek_v4/quantize_to_nvfp4.py`:
- Around line 329-335: The per-block amax computation uses
mxfp4_to_nvfp4_per_block_amax which internally recomputes k_max, but this may
differ from the shared k_max passed to the parent function, causing the
in-range/out-of-range classification to mismatch with the weight_scale_2
computed from the shared k_max. To fix this, either pass the shared k_max
parameter into the mxfp4_to_nvfp4_per_block_amax helper function so it uses the
correct shared value for both classification and scaling, or compute the
per_block_scale and in-range logic directly in this location using the shared
k_max instead of relying on the helper's internally recomputed value. This
ensures the block classification and weight_scale_2 derivation use the same
k_max reference.
In `@examples/diffusers/fastgen/dmd2_recipe.py`:
- Line 669: The torch.load calls at lines 669, 674, 687, and 696 in
dmd2_recipe.py use weights_only=False when loading checkpoints from the
user-supplied restore_from parameter, creating a Remote Code Execution risk if
checkpoints are untrusted or tampered with. For each of these four locations,
either change weights_only=False to weights_only=True to safely deserialize only
tensor data, or if weights_only=False is absolutely necessary for functionality,
add an inline comment explaining the security justification and request approval
from `@NVIDIA/modelopt-setup-codeowners`. The preferred approach is to switch all
four calls to weights_only=True unless there is a documented reason why the
model architecture requires full pickle deserialization.
In `@examples/diffusers/fastgen/export_diffusers_qwen_image.py`:
- Around line 49-52: The example usage in the export_diffusers_qwen_image.py
script shows `--base_pipeline_path Qwen/Qwen-Image`, but the export_diffusers()
function requires a local directory path and will fail with a non-directory
input like a model identifier. Replace the Qwen/Qwen-Image reference in the
example usage (around line 51) with a local snapshot directory path (e.g.,
/path/to/local/qwen_image_base or similar) to accurately reflect the expected
input format.
In `@examples/diffusers/fastgen/inference_dmd2_qwen_image.py`:
- Line 483: The os.makedirs call at line 483 crashes when output_png is a bare
filename because os.path.dirname returns an empty string. Before calling
os.makedirs on the dirname of output_png, guard against empty parent paths by
checking if the dirname is empty and using "." (current directory) as a fallback
when it is. This ensures the code handles both full paths and bare filenames
gracefully.
- Line 153: The torch.load call loading the EMA checkpoint with
weights_only=False enables unsafe pickle deserialization, creating a
code-execution risk for malicious files. Since the ema_path parameter is
caller-supplied without documented safety justification, and EMA state contains
only model weights which can be safely deserialized, change weights_only=False
to weights_only=True in the torch.load call on line 153 to disable pickle
deserialization and load only tensor data safely.
- Around line 145-146: The directory validation at lines 145-146 using
os.path.isdir rejects HuggingFace model IDs like the documented CLI default
"Qwen/Qwen-Image" at line 505, preventing diffusers from resolving the model.
Either remove the os.path.isdir check and let diffusers handle both local paths
and model IDs, or update the CLI default and help text at line 505 to require a
local snapshot path instead. Additionally, add an inline comment at line 153
where torch.load is called with weights_only=False explaining that it is safe
because the EMA checkpoint is internally-generated and trusted, not
user-supplied, to satisfy security guidelines.
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 296-305: Variable expansions in the python command invocations are
unquoted, which can cause argument splitting or glob expansion if the variables
contain spaces or special characters. In the mmlu.py command starting at line
296, wrap all variable expansions including $MODEL_ABS_PATH, $SAVE_PATH,
$MMLU_DATA_PATH, and $mmlu_flags in double quotes to ensure they are treated as
single arguments. Apply the same quoting fix to the corresponding command
invocation at lines 320-323 for consistency, wrapping all variable expansions in
that location with double quotes as well.
In `@modelopt/deploy/llm/generate.py`:
- Around line 291-295: Replace the assert statement in the
generate_context_logits() method that validates enable_kv_cache_reuse with an
explicit if statement that raises a ValueError. The current assert can be
stripped when Python runs with optimization flags (like -O), which would
silently allow incorrect behavior in this public API method. Change the
condition to check if self._enable_kv_cache_reuse is True, and if so, raise a
ValueError with the same descriptive error message that currently appears in the
assert.
In `@modelopt/torch/fastgen/__init__.py`:
- Around line 57-68: The package API surface is not explicitly curated via
`__all__` in the two `__init__.py` files, making API drift likely. In
modelopt/torch/fastgen/__init__.py at lines 57-68, add an explicit `__all__`
list that aggregates all exported names from the wildcard imports (config, ema,
factory, loader, methods.dmd, pipeline modules) and explicitly includes the
module-level re-exports (flow_matching, losses, utils, plugins) to define the
curated public API surface. In modelopt/torch/fastgen/methods/__init__.py at
line 18, after the wildcard re-export from .dmd, add a line that imports __all__
from the dmd module and assigns it as __all__ to explicitly pin the exported
surface, ensuring both files follow the coding guideline of declaring public
surfaces with explicit `__all__` declarations.
In `@modelopt/torch/fastgen/config.py`:
- Around line 94-103: The _check_bounds validator method uses assert statements
to validate external input from YAML configuration, which is unsafe because
asserts can be disabled in optimized Python runs (with -O flag), allowing
invalid configuration to pass validation. Replace all four assert statements
with explicit ValueError raises instead, maintaining the same validation logic
and error messages but using the raise ValueError syntax to guarantee validation
always occurs regardless of Python optimization settings.
In `@modelopt/torch/fastgen/discriminators.py`:
- Around line 92-94: The feature_indices filtering at line 92 in the __init__
method only checks the upper bound (i < num_blocks) but allows negative indices
and can result in an empty set, causing torch.cat to fail later at line 136.
Replace the current filter condition with proper validation that enforces 0 <= i
< num_blocks for each index. Additionally, add a check after filtering to raise
an informative error immediately if feature_indices becomes empty, rather than
allowing silent failure downstream. This validates the input once at the
interface boundary as per coding guidelines.
In `@modelopt/torch/fastgen/ema.py`:
- Around line 127-129: The EMA shadow initialization and reset paths do not
respect the local_shard mode and unnecessarily call _gather_full(), which
triggers expensive all-gathers and memory spikes. Add a conditional check for
config.mode == "local_shard" before calling _gather_full() in the shadow
initialization and reset logic. When in local_shard mode, use the local shard
directly (the parameter p itself or a detached copy) instead of gathering the
full tensor across all ranks. Apply this fix at all locations where
_gather_full() is called during shadow initialization and reset operations,
including the code block around self._shadow[clean] assignment and any similar
shadow update paths.
In `@modelopt/torch/fastgen/flow_matching.py`:
- Around line 185-195: The _truncated_lognormal function creates all tensors on
CPU during sampling and only transfers the final result to the requested device,
causing unnecessary host/device traffic. Pass the device parameter directly to
all tensor creation calls including torch.tensor calls for log_min_t, log_max_t,
mean, and std tensors, as well as the torch.rand call for sampling u, to ensure
all intermediate computations happen on the target device from the start rather
than requiring a final .to() transfer.
In `@modelopt/torch/fastgen/loader.py`:
- Around line 74-93: The candidate paths in the _candidate_paths function are
being added in the wrong order. Currently, filesystem paths are appended before
built-in recipe paths, but the documented contract states built-in recipes
should be checked first. For both the string and Path branches of the
conditional logic, reverse the order of the candidate appends so that all
_BUILTIN_RECIPES_LIB.joinpath calls are executed before the corresponding Path
or direct filesystem path calls. This applies to all four
candidate.append/extend calls in the string branch and all four calls in the
Path branch.
In `@tests/examples/diffusers/sparsity/test_sparsity.py`:
- Around line 165-170: Move the imports currently at lines 165–170 (from
diffusers import AutoencoderKLWan and WanPipeline, the
modelopt.torch.sparsity.attention_sparsity import, the
modelopt.torch.export.export_hf_checkpoint import, and the SparseAttentionModule
import) from inside the test function to the module scope at the top of the file
with the other imports. If any of these imports require deferred loading due to
optional dependencies or circular import concerns, keep them in the function and
add a brief comment explaining why.
In `@tests/unit/torch/utils/test_dataset_utils.py`:
- Around line 38-59: The test reuses the same messages variable after it has
been processed by the first prepare_messages_for_chat_template call, which may
have modified the input. For the second assertion that tests
normalize_tool_calls=False, create a fresh copy of the messages input with the
identical structure instead of reusing the modified messages variable. This
ensures the non-normalizing code path is tested with unmodified input and can
properly detect regressions in how it preserves content when normalization is
disabled.
---
Outside diff comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 646-659: The `has_pack_quantized_config()` function calls `.get()`
directly on `quantization_config` without checking whether it's a dict or an
object, which causes AttributeError at runtime when the config is an object like
`Mxfp4Config` that doesn't support dict-style access. Fix this by adopting the
same pattern already used in `get_original_hf_quant_method()` at lines 546-548:
add an `isinstance(quantization_config, dict)` check to branch between using
`.get()` for dict-style configs and `getattr()` for object-style configs. Apply
this guard to both `.get()` calls in `has_pack_quantized_config()` around lines
630 and 636.
---
Nitpick comments:
In `@modelopt/torch/fastgen/plugins/__init__.py`:
- Around line 24-27: The package __init__.py file in
modelopt/torch/fastgen/plugins/ performs a wildcard import from qwen_image but
does not define its own __all__ to explicitly declare the public API. Define a
module-level __all__ variable (can be initialized as an empty list or with
expected symbols), then within the import_plugin context block for qwen_image,
extend __all__ to include the symbols from qwen_image.__all__ after the
successful import. This ensures the public surface of the package remains
explicit and follows the coding guidelines for defining public APIs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 82f25edc-4ce3-4c7d-9a1c-2b2541d741f9
⛔ Files ignored due to path filters (2)
examples/alpamayo/0417_16rows_train_set_for_calibration_25.10.parquetis excluded by!**/*.parquetexamples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/figures/learning_curves.pngis excluded by!**/*.png
📒 Files selected for processing (122)
.github/actions/cache-extensions/action.yml.github/workflows/_example_tests_runner.yml.github/workflows/example_tests.yml.github/workflows/gpu_tests.yml.github/workflows/unit_tests.ymlCHANGELOG.rstREADME.mddocs/source/guides/10_recipes.rstexamples/alpamayo/README.mdexamples/alpamayo/quantize.pyexamples/deepseek/README.mdexamples/deepseek/deepseek_v3/ptq.pyexamples/deepseek/deepseek_v4/quantize_to_nvfp4.pyexamples/diffusers/README.mdexamples/diffusers/fastgen/README.mdexamples/diffusers/fastgen/configs/dmd2_qwen_image.yamlexamples/diffusers/fastgen/configs/dmd2_qwen_image_smoke.yamlexamples/diffusers/fastgen/dmd2_finetune.pyexamples/diffusers/fastgen/dmd2_recipe.pyexamples/diffusers/fastgen/export_diffusers_qwen_image.pyexamples/diffusers/fastgen/inference_dmd2_qwen_image.pyexamples/diffusers/fastgen/requirements.txtexamples/diffusers/sparsity/README.mdexamples/diffusers/sparsity/wan22_skip_softmax.pyexamples/llm_eval/lm_eval_tensorrt_llm.pyexamples/llm_eval/mmlu.pyexamples/llm_eval/run_simple_eval.shexamples/llm_ptq/cast_mxfp4_to_nvfp4.pyexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/run_tensorrt_llm.pyexamples/llm_ptq/scripts/huggingface_example.shexamples/llm_ptq/scripts/parser.shexamples/llm_qat/ARGUMENTS.mdexamples/megatron_bridge/README.mdexamples/megatron_bridge/requirements.txtexamples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/ABLATIONS.mdexamples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.mdexamples/megatron_bridge/tutorials/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/nemo_evaluator.yamlexamples/megatron_bridge/tutorials/README.mdexamples/pruning/README.mdexamples/pruning/minitron/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/README.mdexamples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/README.mdexamples/pruning/minitron/NVIDIA-Nemotron-Nano-9B-v2/nemo_evaluator.yamlexamples/pruning/minitron_vs_puzzletron/README.mdexamples/puzzletron/README.mdexamples/puzzletron/configs/gptoss-20b_remove_experts_memory/gptoss-20b_remove_experts_memory.yamlexamples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yamlexamples/puzzletron/configs/llama-3_2-3B_pruneffn_memory/llama-3_2-3B_pruneffn_memory.yamlexamples/puzzletron/configs/mistral-small-24b-instruct-2501_pruneffn_memory/mistral-small-24b-instruct-2501_pruneffn_memory.yamlexamples/puzzletron/configs/nemotron-nano-12b-v2/nemotron_nano_12b_v2_pruneffn_memory.yamlexamples/puzzletron/configs/qwen2_5_7b_instruct_pruneffn_memory/qwen2_5_7b_instruct_pruneffn_memory.yamlexamples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yamlexamples/specdec_bench/specdec_bench/models/vllm.pymodelopt/deploy/llm/generate.pymodelopt/onnx/quantization/ort_patching.pymodelopt/onnx/utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/fastgen/__init__.pymodelopt/torch/fastgen/config.pymodelopt/torch/fastgen/discriminators.pymodelopt/torch/fastgen/ema.pymodelopt/torch/fastgen/factory.pymodelopt/torch/fastgen/flow_matching.pymodelopt/torch/fastgen/loader.pymodelopt/torch/fastgen/losses.pymodelopt/torch/fastgen/methods/__init__.pymodelopt/torch/fastgen/methods/dmd.pymodelopt/torch/fastgen/pipeline.pymodelopt/torch/fastgen/plugins/__init__.pymodelopt/torch/fastgen/plugins/qwen_image.pymodelopt/torch/fastgen/utils.pymodelopt/torch/puzzletron/dataset/prepare_dataset.pymodelopt/torch/puzzletron/tools/validate_puzzle_with_multi_replacements.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/plugins/transformers_trainer.pymodelopt/torch/quantization/utils/numeric_utils.pymodelopt/torch/sparsity/attention_sparsity/calibration/calibrate.pymodelopt/torch/sparsity/attention_sparsity/calibration/calibrator.pymodelopt/torch/sparsity/attention_sparsity/config.pymodelopt/torch/sparsity/attention_sparsity/conversion.pymodelopt/torch/sparsity/attention_sparsity/plugins/sparse_attn_config.pymodelopt/torch/utils/dataset_utils.pymodelopt/torch/utils/plugins/megatron_preprocess_data.pymodelopt_recipes/README.mdmodelopt_recipes/configs/ptq/units/default_disabled_quantizers.yamlmodelopt_recipes/general/distillation/dmd2_qwen_image.yamlmodelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_cast.yamlmodelopt_recipes/general/ptq/nvfp4_mlp_only-kv_fp8_cast.yamlmodelopt_recipes/general/ptq/nvfp4_omlp_only-kv_fp8_cast.yamlmodelopt_recipes/general/ptq/nvfp4_weight_only-kv_fp8_cast.yamlmodelopt_recipes/models/Nemotron-H/Nemotron-3-Nano-4B/nvfp4_w4a16.yamlmodelopt_recipes/ptq.mdtests/_test_utils/torch/diffusers_models.pytests/_test_utils/torch/megatron/models.pytests/_test_utils/torch/transformers_models.pytests/examples/alpamayo/test_quantize.pytests/examples/diffusers/conftest.pytests/examples/diffusers/sparsity/test_sparsity.pytests/examples/diffusers_sparsity/test_sparsity.pytests/examples/llm_eval/test_llm_eval.pytests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.pytests/examples/llm_ptq/test_example_utils.pytests/gpu/onnx/quantization/test_plugin.pytests/gpu/onnx/test_ort_patching.pytests/gpu/torch/quantization/test_gpt_oss_mxfp4_nvfp4_cast_cuda.pytests/gpu/torch/utils/test_dataset_utils.pytests/gpu_megatron/torch/quantization/plugins/test_megatron.pytests/unit/recipe/test_loader.pytests/unit/recipe/test_presets.pytests/unit/torch/fastgen/conftest.pytests/unit/torch/fastgen/test_dmd_gradient_routing.pytests/unit/torch/fastgen/test_dmd_math.pytests/unit/torch/fastgen/test_dmd_pipeline_step.pytests/unit/torch/fastgen/test_hook_requirements.pytests/unit/torch/fastgen/test_pred_type_conversion.pytests/unit/torch/fastgen/test_qwen_image_plugin.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_numeric_utils.pytests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.pytests/unit/torch/sparsity/attention_sparsity/test_sparse_attn_config.pytests/unit/torch/utils/test_dataset_utils.py
💤 Files with no reviewable changes (2)
- examples/megatron_bridge/requirements.txt
- tests/examples/diffusers_sparsity/test_sparsity.py
| print( | ||
| f"[autoquant-fwd] v_pred: finite={torch.isfinite(v_pred).all().item()} " | ||
| f"min={v_pred.min().item():.4g} max={v_pred.max().item():.4g} " | ||
| f"abs_mean={v_pred.abs().mean().item():.4g} | " | ||
| f"v_target: finite={torch.isfinite(v_target).all().item()} " | ||
| f"min={v_target.min().item():.4g} max={v_target.max().item():.4g}" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file and check its size
fd -a "quantize.py" examples/alpamayo/
# Then read the relevant section around lines 461-473
cat -n examples/alpamayo/quantize.py | sed -n '450,480p'Repository: NVIDIA/Model-Optimizer
Length of output: 1827
🏁 Script executed:
# Check context around forward_step to see if there's any debug flag or conditional logging
cat -n examples/alpamayo/quantize.py | sed -n '440,480p'Repository: NVIDIA/Model-Optimizer
Length of output: 2247
Remove debug logging or gate behind debug flag to avoid CPU-GPU sync in AutoQuantize hot path.
Lines 461–467 and 470–473 extract GPU tensor values with .item() for logging on every iteration of the optimization loop. This introduces synchronization points that can materially slow the search. Either remove the prints or gate them behind a conditional debug flag (e.g., if debug_logging:).
Per coding guidelines, avoid tensor-to-Python scalar extraction in GPU hot paths unless the CPU needs the value. Here these extractions are only for logging and are unnecessary for computation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/alpamayo/quantize.py` around lines 461 - 467, The debug logging
statements that extract GPU tensor values to CPU scalars using .item() calls (on
v_pred and v_target with torch.isfinite, min, max, and abs().mean() operations)
create unnecessary CPU-GPU synchronization points in the optimization loop hot
path. Either remove these print statements entirely, or gate them behind a
conditional debug flag (such as if debug_logging:) that defaults to False so
they do not execute during normal operation. This will eliminate the
synchronization overhead while preserving the ability to enable detailed logging
when needed for debugging.
Source: Coding guidelines
| if __name__ == "__main__": | ||
| with torch.no_grad(): | ||
| main() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the root cause in this PR file and inspect auto-quantize internals for grad usage.
rg -n -C3 'with torch\.no_grad\(\):|auto_quantize\(|def forward_step|def loss_func|mse_loss' examples/alpamayo/quantize.py
rg -n -C3 'def auto_quantize|backward\(|autograd|grad' modelopt/torch/quantizationRepository: NVIDIA/Model-Optimizer
Length of output: 50379
Remove the global torch.no_grad() wrapper around main().
The global autograd disable at line 651 prevents gradient computation in the --quantize auto path (line 485), which relies on loss-driven search with explicit .backward() calls. The auto_quantize() API uses gradient-based sensitivity scoring by default and expects gradients to flow through the loss function during the search phase.
✅ Proposed fix
if __name__ == "__main__":
- with torch.no_grad():
- main()
+ main()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/alpamayo/quantize.py` around lines 650 - 652, The global
torch.no_grad() context manager wrapping the main() function call disables
gradient computation globally, which prevents the --quantize auto path from
computing gradients needed by the auto_quantize() API for gradient-based
sensitivity scoring during the search phase. Remove the with torch.no_grad():
wrapper and call main() directly without the context manager to allow gradients
to flow through the loss function when needed.
| per_block_amax = mxfp4_to_nvfp4_per_block_amax(blocks, e8m0) # (out, nblk16) fp32 | ||
|
|
||
| m = k_max - E4M3_KMAX | ||
| weight_scale_2 = torch.tensor(2.0**m, dtype=torch.float32, device=bf16.device).reshape(()) | ||
| per_block_scale = ( | ||
| (per_block_amax / (E2M1_MAX * weight_scale_2)) | ||
| .clamp(min=2**-9, max=E4M3_MAX) |
There was a problem hiding this comment.
Use the shared k_max when deriving OOR per-block amax.
mxfp4_to_nvfp4_per_block_amax(blocks, e8m0) recomputes k_max from this tensor, but w1/w3 cast uses the shared fused-GEMM1 k_max passed into _quantize_weight_nvfp4_lossless. A w1/w3 tensor with a lower local max can classify blocks as in-range here while weight_scale_2 is based on the larger shared max, so the OOR fallback no longer matches the scale actually emitted. Thread the shared k_max into the per-block helper or compute in_range/data-derived fallback locally from the same k_max.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/deepseek/deepseek_v4/quantize_to_nvfp4.py` around lines 329 - 335,
The per-block amax computation uses mxfp4_to_nvfp4_per_block_amax which
internally recomputes k_max, but this may differ from the shared k_max passed to
the parent function, causing the in-range/out-of-range classification to
mismatch with the weight_scale_2 computed from the shared k_max. To fix this,
either pass the shared k_max parameter into the mxfp4_to_nvfp4_per_block_amax
helper function so it uses the correct shared value for both classification and
scaling, or compute the per_block_scale and in-range logic directly in this
location using the shared k_max instead of relying on the helper's internally
recomputed value. This ensures the block classification and weight_scale_2
derivation use the same k_max reference.
| ) | ||
|
|
||
| if os.path.isfile(ema_path) and self._dmd_pipeline.ema is not None: | ||
| ema_state = torch.load(ema_path, map_location="cpu", weights_only=False) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the file
find . -name "dmd2_recipe.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 110
🏁 Script executed:
# Check the specific lines mentioned
head -n 700 examples/diffusers/fastgen/dmd2_recipe.py | tail -n 50Repository: NVIDIA/Model-Optimizer
Length of output: 2721
🏁 Script executed:
# Get broader context around the torch.load calls
sed -n '650,710p' examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3251
🏁 Script executed:
# Check if there are inline comments justifying weights_only=False
grep -n "weights_only" examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 440
🏁 Script executed:
# Verify file location and type
ls -la examples/diffusers/fastgen/dmd2_recipe.py && wc -l examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 205
🏁 Script executed:
# Check where these paths originate - find the function/method these lines are in
sed -n '600,670p' examples/diffusers/fastgen/dmd2_recipe.py | head -80Repository: NVIDIA/Model-Optimizer
Length of output: 3510
🏁 Script executed:
# Look for where ema_path, state_path, disc_path, disc_opt_path are defined
grep -n "ema_path\|state_path\|disc_path\|disc_opt_path" examples/diffusers/fastgen/dmd2_recipe.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1505
🏁 Script executed:
# Find the function definition and parameter origins
sed -n '600,670p' examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3510
🏁 Script executed:
# Check where ckpt_dir comes from and what the function signature is
sed -n '550,650p' examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5259
🏁 Script executed:
# Check how restore_from is set and where it comes from
grep -n "restore_from" examples/diffusers/fastgen/dmd2_recipe.py | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 1433
🏁 Script executed:
# Look at the load_checkpoint method to understand the flow
sed -n '700,800p' examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4337
🏁 Script executed:
# Check where load_checkpoint is called and what calls _restore_dmd_extras
grep -B5 -A5 "_restore_dmd_extras" examples/diffusers/fastgen/dmd2_recipe.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1660
Replace unsafe torch.load(..., weights_only=False) with weights_only=True in checkpoint restore path.
These four torch.load calls at lines 669, 674, 687, 696 deserialize checkpoint files with pickle enabled. The checkpoint paths originate from user-supplied restore_from input, creating an RCE risk if a checkpoint is untrusted or tampered with. None of the calls include inline comments justifying why weights_only=False is necessary and safe. Switch to weights_only=True, or if False is genuinely unavoidable, add an inline trust rationale and request security exception approval from @NVIDIA/modelopt-setup-codeowners.
Suggested fix
- ema_state = torch.load(ema_path, map_location="cpu", weights_only=False)
+ ema_state = torch.load(ema_path, map_location="cpu", weights_only=True)
- state = torch.load(state_path, map_location="cpu", weights_only=False)
+ state = torch.load(state_path, map_location="cpu", weights_only=True)
- disc_state = torch.load(disc_path, map_location="cpu", weights_only=False)
+ disc_state = torch.load(disc_path, map_location="cpu", weights_only=True)
- disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=False)
+ disc_opt_state = torch.load(disc_opt_path, map_location="cpu", weights_only=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ema_state = torch.load(ema_path, map_location="cpu", weights_only=False) | |
| ema_state = torch.load(ema_path, map_location="cpu", weights_only=True) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/diffusers/fastgen/dmd2_recipe.py` at line 669, The torch.load calls
at lines 669, 674, 687, and 696 in dmd2_recipe.py use weights_only=False when
loading checkpoints from the user-supplied restore_from parameter, creating a
Remote Code Execution risk if checkpoints are untrusted or tampered with. For
each of these four locations, either change weights_only=False to
weights_only=True to safely deserialize only tensor data, or if
weights_only=False is absolutely necessary for functionality, add an inline
comment explaining the security justification and request approval from
`@NVIDIA/modelopt-setup-codeowners`. The preferred approach is to switch all four
calls to weights_only=True unless there is a documented reason why the model
architecture requires full pickle deserialization.
Source: Coding guidelines
| python export_diffusers_qwen_image.py \\ | ||
| --student_path /path/to/checkpoint/epoch_0_step_500/model/consolidated \\ | ||
| --base_pipeline_path Qwen/Qwen-Image \\ | ||
| --output_dir /path/to/output/qwen_image_dmd2 \\ |
There was a problem hiding this comment.
Fix the --base_pipeline_path usage example.
The example on Line 51 uses Qwen/Qwen-Image, but export_diffusers() requires a local directory and fails on non-directory input. Please update the example/help text to a local snapshot path (or broaden input handling).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/diffusers/fastgen/export_diffusers_qwen_image.py` around lines 49 -
52, The example usage in the export_diffusers_qwen_image.py script shows
`--base_pipeline_path Qwen/Qwen-Image`, but the export_diffusers() function
requires a local directory path and will fail with a non-directory input like a
model identifier. Replace the Qwen/Qwen-Image reference in the example usage
(around line 51) with a local snapshot directory path (e.g.,
/path/to/local/qwen_image_base or similar) to accurately reflect the expected
input format.
| full = _gather_full(p.detach(), fsdp2=config.fsdp2) | ||
| target_dtype = _resolve_dtype(config.dtype, full.dtype) | ||
| self._shadow[clean] = copy.deepcopy(full).to(dtype=target_dtype) |
There was a problem hiding this comment.
local_shard mode is bypassed during EMA shadow init/reset.
When config.mode == "local_shard", these paths still call _gather_full(...), which can trigger unnecessary all-gathers and memory spikes.
Suggested fix
- full = _gather_full(p.detach(), fsdp2=config.fsdp2)
+ full = (
+ _gather_full(p.detach(), fsdp2=config.fsdp2)
+ if config.mode == "full_tensor"
+ else (p.detach().to_local() if _is_distributed_tensor(p) else p.detach())
+ )
target_dtype = _resolve_dtype(config.dtype, full.dtype)
self._shadow[clean] = copy.deepcopy(full).to(dtype=target_dtype)
...
- live = _gather_full(p.detach(), fsdp2=self.config.fsdp2)
+ live = (
+ _gather_full(p.detach(), fsdp2=self.config.fsdp2)
+ if self.config.mode == "full_tensor"
+ else (p.detach().to_local() if _is_distributed_tensor(p) else p.detach())
+ )
shadow.copy_(live.to(device=shadow.device, dtype=shadow.dtype))Also applies to: 252-253
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/fastgen/ema.py` around lines 127 - 129, The EMA shadow
initialization and reset paths do not respect the local_shard mode and
unnecessarily call _gather_full(), which triggers expensive all-gathers and
memory spikes. Add a conditional check for config.mode == "local_shard" before
calling _gather_full() in the shadow initialization and reset logic. When in
local_shard mode, use the local shard directly (the parameter p itself or a
detached copy) instead of gathering the full tensor across all ranks. Apply this
fix at all locations where _gather_full() is called during shadow initialization
and reset operations, including the code block around self._shadow[clean]
assignment and any similar shadow update paths.
| log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64) | ||
| log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64) | ||
| normal = Normal( | ||
| torch.tensor(mean, dtype=torch.float64), | ||
| torch.tensor(std, dtype=torch.float64), | ||
| ) | ||
| cdf_min = normal.cdf(log_min_t) | ||
| cdf_max = normal.cdf(log_max_t) | ||
| u = torch.rand(n, dtype=torch.float64) * (cdf_max - cdf_min) + cdf_min | ||
| t = normal.icdf(u).exp() | ||
| return t.to(device=device, dtype=dtype) |
There was a problem hiding this comment.
_truncated_lognormal ignores the requested device during sampling.
This path samples on CPU and transfers later, which adds avoidable host/device traffic in training loops.
Suggested fix
- log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64)
- log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64)
+ log_min_t = torch.tensor(math.log(min_t), dtype=torch.float64, device=device)
+ log_max_t = torch.tensor(math.log(max_t), dtype=torch.float64, device=device)
normal = Normal(
- torch.tensor(mean, dtype=torch.float64),
- torch.tensor(std, dtype=torch.float64),
+ torch.tensor(mean, dtype=torch.float64, device=device),
+ torch.tensor(std, dtype=torch.float64, device=device),
)
cdf_min = normal.cdf(log_min_t)
cdf_max = normal.cdf(log_max_t)
- u = torch.rand(n, dtype=torch.float64) * (cdf_max - cdf_min) + cdf_min
+ u = torch.rand(n, dtype=torch.float64, device=device) * (cdf_max - cdf_min) + cdf_min
t = normal.icdf(u).exp()
return t.to(device=device, dtype=dtype)As per coding guidelines, keep tensor work on the GPU and avoid unnecessary CPU-GPU syncs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/fastgen/flow_matching.py` around lines 185 - 195, The
_truncated_lognormal function creates all tensors on CPU during sampling and
only transfers the final result to the requested device, causing unnecessary
host/device traffic. Pass the device parameter directly to all tensor creation
calls including torch.tensor calls for log_min_t, log_max_t, mean, and std
tensors, as well as the torch.rand call for sampling u, to ensure all
intermediate computations happen on the target device from the start rather than
requiring a final .to() transfer.
Source: Coding guidelines
| # Normalize to string for suffix probing; keep Path/Traversable behavior otherwise. | ||
| if isinstance(config_file, str): | ||
| base = config_file | ||
| if base.endswith(_SUFFIXES): | ||
| candidates.append(Path(base)) | ||
| candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base)) | ||
| else: | ||
| candidates.extend(Path(base + suffix) for suffix in _SUFFIXES) | ||
| candidates.extend(_BUILTIN_RECIPES_LIB.joinpath(base + suffix) for suffix in _SUFFIXES) | ||
| elif isinstance(config_file, Path): | ||
| if config_file.suffix in _SUFFIXES: | ||
| candidates.append(config_file) | ||
| if not config_file.is_absolute(): | ||
| candidates.append(_BUILTIN_RECIPES_LIB.joinpath(str(config_file))) | ||
| else: | ||
| candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES) | ||
| if not config_file.is_absolute(): | ||
| candidates.extend( | ||
| _BUILTIN_RECIPES_LIB.joinpath(str(config_file) + suffix) for suffix in _SUFFIXES | ||
| ) |
There was a problem hiding this comment.
Candidate resolution order is inverted vs the documented contract.
The module docs say built-in recipes are checked first, but _candidate_paths currently prioritizes filesystem paths.
Suggested fix
if isinstance(config_file, str):
base = config_file
if base.endswith(_SUFFIXES):
- candidates.append(Path(base))
- candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base))
+ candidates.append(_BUILTIN_RECIPES_LIB.joinpath(base))
+ candidates.append(Path(base))
else:
- candidates.extend(Path(base + suffix) for suffix in _SUFFIXES)
candidates.extend(_BUILTIN_RECIPES_LIB.joinpath(base + suffix) for suffix in _SUFFIXES)
+ candidates.extend(Path(base + suffix) for suffix in _SUFFIXES)
elif isinstance(config_file, Path):
if config_file.suffix in _SUFFIXES:
- candidates.append(config_file)
if not config_file.is_absolute():
candidates.append(_BUILTIN_RECIPES_LIB.joinpath(str(config_file)))
+ candidates.append(config_file)
else:
- candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)
if not config_file.is_absolute():
candidates.extend(
_BUILTIN_RECIPES_LIB.joinpath(str(config_file) + suffix) for suffix in _SUFFIXES
)
+ candidates.extend(Path(str(config_file) + suffix) for suffix in _SUFFIXES)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/fastgen/loader.py` around lines 74 - 93, The candidate paths
in the _candidate_paths function are being added in the wrong order. Currently,
filesystem paths are appended before built-in recipe paths, but the documented
contract states built-in recipes should be checked first. For both the string
and Path branches of the conditional logic, reverse the order of the candidate
appends so that all _BUILTIN_RECIPES_LIB.joinpath calls are executed before the
corresponding Path or direct filesystem path calls. This applies to all four
candidate.append/extend calls in the string branch and all four calls in the
Path branch.
| from diffusers import AutoencoderKLWan, WanPipeline | ||
|
|
||
| import modelopt.torch.sparsity.attention_sparsity as mtsa | ||
| from modelopt.torch.export import export_hf_checkpoint | ||
| from modelopt.torch.sparsity.attention_sparsity.sparse_attention import SparseAttentionModule | ||
|
|
There was a problem hiding this comment.
Move imports to module scope.
At lines 165–170, imports are inside the test function without justification. Per guidelines, imports belong at file top so errors surface at collection time. Move them to module scope unless they require deferred loading (optional dependencies or circular imports) — in which case add a brief comment explaining why.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/examples/diffusers/sparsity/test_sparsity.py` around lines 165 - 170,
Move the imports currently at lines 165–170 (from diffusers import
AutoencoderKLWan and WanPipeline, the modelopt.torch.sparsity.attention_sparsity
import, the modelopt.torch.export.export_hf_checkpoint import, and the
SparseAttentionModule import) from inside the test function to the module scope
at the top of the file with the other imports. If any of these imports require
deferred loading due to optional dependencies or circular import concerns, keep
them in the function and add a brief comment explaining why.
| def test_prepare_messages_for_chat_template(): | ||
| messages = [ | ||
| { | ||
| "role": "assistant", | ||
| "content": "answer", | ||
| "reasoning_content": "think", | ||
| "tool_calls": [ | ||
| {"function": {"name": "search", "arguments": '{"q": "x"}'}}, | ||
| ], | ||
| }, | ||
| ] | ||
| prepared = prepare_messages_for_chat_template( | ||
| messages, reasoning_content="native", normalize_tool_calls=True | ||
| ) | ||
| assert prepared[0]["reasoning_content"] == "think" | ||
| assert prepared[0]["tool_calls"][0]["function"]["arguments"] == {"q": "x"} | ||
| assert ( | ||
| prepare_messages_for_chat_template( | ||
| messages, reasoning_content="native", normalize_tool_calls=False | ||
| ) | ||
| is messages | ||
| ) |
There was a problem hiding this comment.
Use fresh input for the non-normalizing branch assertion.
The second assertion reuses messages after the normalizing call, so it can miss regressions in the normalize_tool_calls=False path’s content preservation.
Suggested test adjustment
def test_prepare_messages_for_chat_template():
- messages = [
+ messages = [
{
"role": "assistant",
"content": "answer",
"reasoning_content": "think",
"tool_calls": [
{"function": {"name": "search", "arguments": '{"q": "x"}'}},
],
},
]
prepared = prepare_messages_for_chat_template(
messages, reasoning_content="native", normalize_tool_calls=True
)
assert prepared[0]["reasoning_content"] == "think"
assert prepared[0]["tool_calls"][0]["function"]["arguments"] == {"q": "x"}
+
+ raw_messages = [
+ {
+ "role": "assistant",
+ "content": "answer",
+ "reasoning_content": "think",
+ "tool_calls": [
+ {"function": {"name": "search", "arguments": '{"q": "x"}'}},
+ ],
+ },
+ ]
assert (
prepare_messages_for_chat_template(
- messages, reasoning_content="native", normalize_tool_calls=False
+ raw_messages, reasoning_content="native", normalize_tool_calls=False
)
- is messages
+ is raw_messages
)As per coding guidelines, checked-in tests should protect expected behavior and regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/torch/utils/test_dataset_utils.py` around lines 38 - 59, The test
reuses the same messages variable after it has been processed by the first
prepare_messages_for_chat_template call, which may have modified the input. For
the second assertion that tests normalize_tool_calls=False, create a fresh copy
of the messages input with the identical structure instead of reusing the
modified messages variable. This ensures the non-normalizing code path is tested
with unmodified input and can properly detect regressions in how it preserves
content when normalization is disabled.
Source: Coding guidelines
Cherry-picked PRs
Summary by CodeRabbit
Release Notes
New Features
Documentation
Bug Fixes